Colombia COVID-19

LINK: https://www.kaggle.com/camesruiz/colombia-covid19-complete-dataset

DESCRIPTION: Coronavirus (COVID-19) made its outbreak in Colombia with the first confirmed in the country on march 6th, since then, number of confirmed cases has been increasing and deaths related to the virus are starting to have the first confirmed cases. This data set contains complete information about confirmed cases, deaths and number of recovered patients according to the daily reports by the colombian health department (Ministerio de Salud)

GOAL: Build a model for the number of confirmed cases in the different Colombia regions. You have the access to some covariates, such as: Edad (age), Sexo (Sex), Pais de procedencia (origin country) of the individual cases. Try to test the predictive accuracy of your selected model.

ATTENTION: Three countries are here considered: Colombia, Mexico and India. Each different group of students should focus on a geographical sub-area of one of the three countries, say the northern, the central or the southern part of the countries, by pooling all the regions/states/departments belonging to the considered area. Say: group A focuses on Northern Mexico, group B on Central Mexico, and so on. The distinction in northern, central and southern is not strict, you have some flexibility.


Our Project

We decided to do central Colombia because it is where the capital is.

The largest cities in the country are Bogotá (in the center), Medellín (in the north, close to central), Cali (in the center) and Barranquilla (extreme north).

Dataset - First Exploration

##  [1] "Bogotá D.C."           "Valle del Cauca"       "Antioquia"            
##  [4] "Cartagena D.T. y C"    "Huila"                 "Meta"                 
##  [7] "Risaralda"             "Norte de Santander"    "Caldas"               
## [10] "Cundinamarca"          "Barranquilla D.E."     "Santander"            
## [13] "Quindío"               "Tolima"                "Cauca"                
## [16] "Santa Marta D.T. y C." "Cesar"                 "San Andrés"           
## [19] "Casanare"              "Nariño"                "Atlántico"            
## [22] "Boyacá"                "Córdoba"               "Bolívar"              
## [25] "Sucre"                 "La Guajira"
## # A tibble: 26 x 1
##    `Departamento o Distrito`
##    <chr>                    
##  1 Antioquia                
##  2 Atlántico                
##  3 Barranquilla D.E.        
##  4 Bogotá D.C.              
##  5 Bolívar                  
##  6 Boyacá                   
##  7 Caldas                   
##  8 Cartagena D.T. y C       
##  9 Casanare                 
## 10 Cauca                    
## # … with 16 more rows

Colombia is divided into 32 departments. According to Wikipedia we miss the Departments of Amazonas, Arauca, Caquetá, Chocó, Guainía, Guaviare, Magdalena, Putumayo, Vaupés, Vichada.

Bogotá, Distrito Capital in in the Cundinamarca Department. Barranquilla D.E. is a “Distrito Especial” but should be in the Atlántico Department.

The Districts (Spanish: Distrito) in Colombia are cities that have a feature that highlights them, such as its location and trade, history or tourism. Arguably, the districts are special municipalities. The districts are Bogotá, Barranquilla, Cartagena, Santa Marta, Cúcuta, Popayán, Tunja, Buenaventura, Turbo and Tumaco.

We miss Cúcuta, Popayán, Tunjaa, Buenaventura, Turbo and Tumaco.

#lat-long
#bogota<c(4.592164298, -74.072166378, 542)
valle_cauca<-c(3.359889, -76.638565, 162) #cauca è lo stesso 
antioquia<-c(6.230833, -75.590553, 127)
cartagena<-c(10.39972, -75.51444, 39)
huila<-c(2.916112, -75.283440, 30)
meta<-c(4.151382, -73.637688, 12)
risaralda<-c(4.8133302,  -75.6961136, 35)
norte_santander<-c(7.916663, -72.666664, 21)
caldas<-c(5.070275, -75.513817, 16)
cudinamarca<-c(4.862437, -74.058655, 42)
barraquilla<-c(10.9583295, -74.791163502, 35) #atlantico
santader<-c(7.12539, -73.1198, 12)
quindio<-c(4.535000, -75.675690, 23)
tolima<-c(4.43889, -75.23222, 14)
santa_marta<-c(11.24079, -74.19904, 12)
cesar<-c(10.46314, -73.25322, 16)
san_andres<-c(12.542499, -81.718369, 2)
casanare<-c(5.33775, -72.39586, 2)
narino<-c(1.21361, -77.28111, 6)
boyaca<-c(5.767222, -72.940651, 6)
cordoba<-c(8.74798, -75.88143, 2)
bolivar<-c(4.3387, -76.18342, 3)
sucre<-c(8.81136, -74.72084, 1)
guajira<-c(7.370165186, -76.7088304, 1)

gis_data<-data.frame(latitude=4.624335, longitude=-74.063644, cases=542) #bogotà
gis_data<-rbind(gis_data, valle_cauca)
gis_data<-rbind(gis_data, antioquia)
gis_data<-rbind(gis_data, cartagena)
gis_data<-rbind(gis_data, huila)
gis_data<-rbind(gis_data, meta)
gis_data<-rbind(gis_data, risaralda)
gis_data<-rbind(gis_data, norte_santander)
gis_data<-rbind(gis_data, caldas)
gis_data<-rbind(gis_data, cudinamarca)
gis_data<-rbind(gis_data, barraquilla)
gis_data<-rbind(gis_data, santader)
gis_data<-rbind(gis_data, quindio)
gis_data<-rbind(gis_data, tolima)
gis_data<-rbind(gis_data, santa_marta)
gis_data<-rbind(gis_data, cesar)
gis_data<-rbind(gis_data, san_andres)
gis_data<-rbind(gis_data, casanare)
gis_data<-rbind(gis_data, narino)
gis_data<-rbind(gis_data, boyaca)
gis_data<-rbind(gis_data, cordoba)
gis_data<-rbind(gis_data, bolivar)
gis_data<-rbind(gis_data, sucre)
gis_data<-rbind(gis_data, guajira)

gis_data
##     latitude longitude cases
## 1   4.624335 -74.06364   542
## 2   3.359889 -76.63856   162
## 3   6.230833 -75.59055   127
## 4  10.399720 -75.51444    39
## 5   2.916112 -75.28344    30
## 6   4.151382 -73.63769    12
## 7   4.813330 -75.69611    35
## 8   7.916663 -72.66666    21
## 9   5.070275 -75.51382    16
## 10  4.862437 -74.05866    42
## 11 10.958329 -74.79116    35
## 12  7.125390 -73.11980    12
## 13  4.535000 -75.67569    23
## 14  4.438890 -75.23222    14
## 15 11.240790 -74.19904    12
## 16 10.463140 -73.25322    16
## 17 12.542499 -81.71837     2
## 18  5.337750 -72.39586     2
## 19  1.213610 -77.28111     6
## 20  5.767222 -72.94065     6
## 21  8.747980 -75.88143     2
## 22  4.338700 -76.18342     3
## 23  8.811360 -74.72084     1
## 24  7.370165 -76.70883     1

The color of the pins is related with the number of cases: if they are less than 10 the color is “green”, if they are less than 100 the color is “orange”, otherwise it is “red”.
On the map there are all the cities/departments for which we have data. We can notice that we don’t have any data in the south of the country.

Reading here and there I found that Colombia in divided in 5 regions, the central one comprises: Boyacá, Tolima, Cundinamarca, Meta, Bogotà DC.

ANGELA: Seeing Wikipedia I think that the Orinoquía Region (Meta, Arauca, Casanare and Vichada Departments) is in the center, so I would add also Arauca, Casanare and Vichada. I noticed that we only have Casanare, the other two doesn’t have data.

However, since in our assignment Colombia is divided in 3 parts, I think that we should add some more regions (e.g. Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá and possibly Antioquia and Santander)

## # A tibble: 979 x 9
##    `ID de caso` `Fecha de diagn… `Ciudad de ubic… `Departamento o… `Atención**`
##           <dbl> <chr>            <chr>            <chr>            <chr>       
##  1            1 06/03/2020       Bogotá           Bogotá D.C.      Recuperado  
##  2            2 09/03/2020       Buga             Valle del Cauca  Recuperado  
##  3            3 09/03/2020       Medellín         Antioquia        Recuperado  
##  4            4 11/03/2020       Medellín         Antioquia        Recuperado  
##  5            5 11/03/2020       Medellín         Antioquia        Recuperado  
##  6            6 11/03/2020       Itagüí           Antioquia        Casa        
##  7            8 11/03/2020       Bogotá           Bogotá D.C.      Recuperado  
##  8            9 11/03/2020       Bogotá           Bogotá D.C.      Recuperado  
##  9           10 12/03/2020       Bogotá           Bogotá D.C.      Recuperado  
## 10           11 12/03/2020       Bogotá           Bogotá D.C.      Casa        
## # … with 969 more rows, and 4 more variables: Edad <dbl>, Sexo <chr>,
## #   `Tipo*` <chr>, `País de procedencia` <chr>

Some very basics plots

Let’s check the situation (and also the power of ggplot)!

Scattered infos about pandemic in Colombia (https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Colombia):

  • the quarantine started on the 20th of March, since our data are from 6th of March to 2nd of April, it is very likeliy that quarantine effects are not witnessed in our data.

  • on March the 26th there was a damage in the machine that prepared the samples for processing and subsequent diagnosis of COVID-19, which affected the speed at which results were being produced. This could explain the very low number of confirmed cases.


The major number of cases are in the capital Bogotà.


The previous plot represents the daily incidence of the desease across all the departments we are taking into account.

Let’s check the general trend by looking at the cumulative number of confirmed cases (again, all “our” departments are taken into account):

##    Fecha de diagnóstico Cumulative confirmed
## 1            2020-03-06                    1
## 2            2020-03-09                    3
## 3            2020-03-11                    9
## 4            2020-03-12                   11
## 5            2020-03-13                   16
## 6            2020-03-14                   24
## 7            2020-03-15                   45
## 8            2020-03-16                   57
## 9            2020-03-17                   75
## 10           2020-03-18                  102
## 11           2020-03-19                  128
## 12           2020-03-20                  175
## 13           2020-03-21                  210
## 14           2020-03-22                  240
## 15           2020-03-23                  306
## 16           2020-03-24                  419
## 17           2020-03-25                  481
## 18           2020-03-26                  491
## 19           2020-03-27                  539
## 20           2020-03-28                  603
## 21           2020-03-29                  700
## 22           2020-03-30                  798
## 23           2020-03-31                  906
## 24           2020-04-01                 1065
## 25           2020-04-02                 1161

Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).

In order to confirm it we should fit a log-linear model, and check that it produces a constant growth rate (straight line).

Now let’s explore the distribution of cases across genders and age:


Maybe in order to study the distribution of ages we should divide the ages in groups, for example 0-18, 18-30, 30-45, 45-60, 60-75, 75+.


This is quite surprising.. I expected elder people to be more affected by Covid-19!

The overall life expectancy in Colombia at birth is 74.8 years (71.2 years for males and 78.4 years for females). Wikipedia

Instead, the median age of the population in 2015 was 29.5 years (30.4 in 2018, 31.3 in 2020), so it is a Country full of young people! link or link or link

Now we can analyze jointly the distribution of age and sex (sex distribution across group of age):

##    age_group Sexo count
## 1       0-18    F   -20
## 2      19-30    F  -119
## 3      31-45    F  -149
## 4      46-60    F  -113
## 5      60-75    F   -69
## 6        76+    F   -11
## 7       0-18    M    23
## 8      19-30    M   119
## 9      31-45    M   154
## 10     46-60    M   124
## 11     60-75    M    61
## 12       76+    M    17

There isn’t much difference between the sexes among the different group of ages, I have the impression that the covariates present in the dataset won’t help us!! :(

We are now left to explore the Tipo variable:


I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:

## # A tibble: 3 x 3
##   tipo        total_number percentage
##   <chr>              <int> <chr>     
## 1 Relacionado          281 28.7%     
## 2 Importado            465 47.5%     
## 3 En estudio           233 23.8%

Almost half of the total confirmed cases in our region of interest are imported, and a significant percentage is anknown wheter it is imported or not. Again this is in some sense interesting, but I don’t see clearly why this should be helpful in our model!

## # A tibble: 55 x 2
## # Groups:   País de procedencia [55]
##    `País de procedencia`     n
##    <chr>                 <int>
##  1 0                         1
##  2 Alemania                  4
##  3 Alemania - Estambul       1
##  4 Arabia                    1
##  5 Argentina                 2
##  6 Aruba                     2
##  7 Bélgica                   1
##  8 Brasil                   10
##  9 Canadá                    1
## 10 Chile                     2
## # … with 45 more rows

here data are a bit dirty, however I don’t know if the effort of cleaning them will worth the result.. it depends wheter we decide to use this info in our analysis

Missing

I still didn’t integrate the “other part” of the dataset, the one concerning deaths!

Ideas

For what concerns the predictive model we want to build, I think that we should start by something very simple (e.g. a (log)linear model) and take it as a baseline.

Then we build something more complex (such as a hierarchical model) and see the improvements with respect to the baseline.

If possible I would put inside something bayesian, since I understood that they really like this kind of stuff!